Thread-local memory address

Robert_Crovella · November 24, 2021, 5:19pm

The local memory complement that the GPU must maintain is not actually a function of the maximum thread complement of a kernel launch (e.g. 2^30) but instead a function of the maximum thread-carrying capacity of the GPU. This is a much smaller number, on the order of several hundred thousand, maximum, based on current architectures (that number grows over time). The maximum thread carrying capacity of the GPU is the number of SMs times the maximum threads per SM. A100 for example has 108 SMs each of which can carry 2048 threads. So that is a maximum thread complement of 2048x108 = 221,184 threads.

However, even that number, if multiplied by 512KB per thread, would yield ~110GB, which is larger than the 80GB available on the larger A100 variant, currently. So the conclusion we reach is that the 512KB number is an upper bound, and in fact the actual possible local memory complement per thread may be lower, and is something that may not be discoverable until runtime (since the compiler does not know what GPU you will be running on, and obviously the ratio of GPU memory to thread complement matters here).

I’m not sure any detailed implementation is specified, but the above description should provide a suitable mental model for the CUDA programmer.

As you are perhaps starting to discover, questions like this are already answered in one forum post or another. For example here is what I consider to be the “canonical” description of the local memory calculation. Therefore, you may find additional info via research, and of course the cuda documentation is a resource as well. I recently just went through the ptx guide just searching on the word “local” to answer a question I had, recently.

Topic		Replies	Views
Local memory size CUDA Programming and Performance	8	7671	November 14, 2008
Thread Local Memory CUDA Programming and Performance	1	6859	January 26, 2016
One question regarding shared memory CUDA Programming and Performance	5	1234	April 24, 2013
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16299	January 30, 2011
Per thread local memory Per thread local memory specified in C Programming Guide CUDA Programming and Performance	1	846	March 6, 2012
questions on register, local memory and block CUDA Programming and Performance	5	4887	February 28, 2008
Efficient use of shared memory CUDA Programming and Performance	29	4405	December 2, 2019
texture memory limt CUDA Programming and Performance	7	7864	July 30, 2009
Why does a simple single-threaded CUDA kernel consume large amounts of global memory? CUDA Programming and Performance	7	6561	February 24, 2011
Occupancy wierdness.... Is the calculator wrong? CUDA Programming and Performance	5	5900	July 25, 2007

Thread-local memory address

Related topics