The local memory complement that the GPU must maintain is not actually a function of the maximum thread complement of a kernel launch (e.g. 2^30) but instead a function of the maximum thread-carrying capacity of the GPU. This is a much smaller number, on the order of several hundred thousand, maximum, based on current architectures (that number grows over time). The maximum thread carrying capacity of the GPU is the number of SMs times the maximum threads per SM. A100 for example has 108 SMs each of which can carry 2048 threads. So that is a maximum thread complement of 2048x108 = 221,184 threads.
However, even that number, if multiplied by 512KB per thread, would yield ~110GB, which is larger than the 80GB available on the larger A100 variant, currently. So the conclusion we reach is that the 512KB number is an upper bound, and in fact the actual possible local memory complement per thread may be lower, and is something that may not be discoverable until runtime (since the compiler does not know what GPU you will be running on, and obviously the ratio of GPU memory to thread complement matters here).
I’m not sure any detailed implementation is specified, but the above description should provide a suitable mental model for the CUDA programmer.
As you are perhaps starting to discover, questions like this are already answered in one forum post or another. For example here is what I consider to be the “canonical” description of the local memory calculation. Therefore, you may find additional info via research, and of course the cuda documentation is a resource as well. I recently just went through the ptx guide just searching on the word “local” to answer a question I had, recently.