Thread-local memory address

yingweidadadadadada · November 24, 2021, 4:30am

__global__ void simulated() {
  int tid = blockDim.x * blockIdx.x + threadIdx.x;

  int local[2];
  local[0] = 0;
  local[1] = 1;

  // print thread id and local array address
  // if (tid < 2) {
  //  printf("tid:%d, %p\n", tid, local);
  // }
}

I am trying to understand how cuda manages thread-local memory by using the above code snippet. Ideally, each thread will get its own local array, so the address of local should be different. However, when I print out its address, they are all the same across all threads.

tid:0, 0x7f4ccafffce8
tid:1, 0x7f4ccafffce8

I also tried output the SASS code for this code snippet. It seems to me that the instruction-level code also uses the exact same address for all threads.

        /*0000*/                   IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;      /* 0x00000a00ff017624 */
                                                                                /* 0x000fca00078e00ff */
        /*0010*/                   IADD3 R1, R1, -0x8, RZ ;                     /* 0xfffffff801017810 */
                                                                                /* 0x000fe20007ffe0ff */
        /*0020*/                   IMAD.MOV.U32 R5, RZ, RZ, 0x1 ;               /* 0x00000001ff057424 */
                                                                                /* 0x000fe200078e00ff */
        /*0030*/                   ULDC.64 UR4, c[0x0][0x118] ;                 /* 0x0000460000047ab9 */
                                                                                /* 0x000fe40000000a00 */
        /*0040*/                   IADD3 R2, P0, R1, c[0x0][0x20], RZ ;         /* 0x0000080001027a10 */
                                                                                /* 0x000fc80007f1e0ff */
        /*0050*/                   IADD3.X R3, RZ, c[0x0][0x24], RZ, P0, !PT ;  /* 0x00000900ff037a10 */
                                                                                /* 0x000fca00007fe4ff */
        /*0060*/                   ST.E [R2.64+0x4], R5 ;                       /* 0x0000040502007985 */
                                                                                /* 0x000fe8000c101904 */
        /*0070*/                   ST.E [R2.64], RZ ;                           /* 0x000000ff02007985 */
                                                                                /* 0x000fe2000c101904 */
        /*0080*/                   EXIT ;                                       /* 0x000000000000794d */
                                                                                /* 0x000fea0003800000 */
        /*0090*/                   BRA 0x90;                                    /* 0xfffffff000007947 */

Can anyone share some insights how cuda exactly manage local memory per thread? To compile this code, you probably need to disable all compiler optimization. Otherwise, many instructions will be skipped.

njuffa · November 24, 2021, 11:09am

This may be due to code optimization. The code is never writing to the array elements, it is really using just two constants, 1 and 2. These are used by all the threads, so can be shared. What happens if you try this:

  local[0] = tid + 0;
  local[1] = tid + 1;

Robert_Crovella · November 24, 2021, 1:59pm

This is expected behavior for addresses in local space, which is private to each thread. The local space is represented as a “window” in the global space, and the variables in local space are each at the same location (when considered across threads), relative to the window base offset, in the global space. The LD/ST unit knows when an address will target the local space, and creates a per-thread offset to the base address, to create an appropriate unique per-thread location for the local address.

A related idea is that when threads across a warp access the “same” variable in local space, there will be a general tendency for these accesses, when resolved to device memory, to be coalesced. The address offsetting provided by the LD/ST unit is part of this behavior.

yingweidadadadadada · November 24, 2021, 4:56pm

Thanks @Robert_Crovella !

Another followup is more about how local memory is managed internally.

I understand that each thread can obtain a max of 512KB local memory. In CUDA, each kernel can launch many threads (e.g., 2^30). In this case, if a thread requests 128B array, the total local memory size will be 128GB, which goes out of GPU device memory capacity easily. I guess CUDA definitely dynamically manages the local memory during runtime. I am curious about how that is exactly implemented?

Can you provide me some resources to understand more how that is done in CUDA? By looking the SASS, it is hard to understand the internal.

Robert_Crovella · November 24, 2021, 5:19pm

The local memory complement that the GPU must maintain is not actually a function of the maximum thread complement of a kernel launch (e.g. 2^30) but instead a function of the maximum thread-carrying capacity of the GPU. This is a much smaller number, on the order of several hundred thousand, maximum, based on current architectures (that number grows over time). The maximum thread carrying capacity of the GPU is the number of SMs times the maximum threads per SM. A100 for example has 108 SMs each of which can carry 2048 threads. So that is a maximum thread complement of 2048x108 = 221,184 threads.

However, even that number, if multiplied by 512KB per thread, would yield ~110GB, which is larger than the 80GB available on the larger A100 variant, currently. So the conclusion we reach is that the 512KB number is an upper bound, and in fact the actual possible local memory complement per thread may be lower, and is something that may not be discoverable until runtime (since the compiler does not know what GPU you will be running on, and obviously the ratio of GPU memory to thread complement matters here).

I’m not sure any detailed implementation is specified, but the above description should provide a suitable mental model for the CUDA programmer.

As you are perhaps starting to discover, questions like this are already answered in one forum post or another. For example here is what I consider to be the “canonical” description of the local memory calculation. Therefore, you may find additional info via research, and of course the cuda documentation is a resource as well. I recently just went through the ptx guide just searching on the word “local” to answer a question I had, recently.

system · December 8, 2021, 5:19pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.