Different CTAs Accessing the Same Shared Memory Address on RTX 5090 — Is This Expected?

Hi, I’m debugging a CUTLASS kernel on an RTX 5090 (which does not support clusters), and I noticed something strange.

According to my understanding, shared memory is allocated per-CTA, so different thread blocks should never access the same shared memory address. However, in my debug prints, I see different CTAs reporting the same shared memory pointer value. For example, logs look like this:

THR (0,0,0) BLK (6,0,0) TMACRD (0,0,0,0,0) SMEMADDR (0x7f8700000400)
THR (0,0,0) BLK (12,0,0) TMACRD (0,0,0,0,0) SMEMADDR (0x7f8700000400)
THR (0,0,0) BLK (4,0,0) TMACRD (0,0,0,0,0) SMEMADDR (0x7f8700000400)
THR (0,0,0) BLK (10,0,0) TMACRD (0,0,0,0,0) SMEMADDR (0x7f8700000400)
THR (0,0,0) BLK (7,0,0) TMACRD (0,0,0,0,0) SMEMADDR (0x7f8700000400)
THR (0,0,0) BLK (8,0,0) TMACRD (0,0,0,0,0) SMEMADDR (0x7f8700000400)
...
THR (0,0,0) BLK (49,0,0) TMACRD (0,128,0,0,0) SMEMADDR (0x7f8700008400)
THR (0,0,0) BLK (56,0,0) TMACRD (0,1024,0,0,0) SMEMADDR (0x7f8700008400)
THR (0,0,0) BLK (47,0,0) TMACRD (0,1920,0,0,0) SMEMADDR (0x7f8700008400)
THR (0,0,0) BLK (57,0,0) TMACRD (0,1152,0,0,0) SMEMADDR (0x7f8700008400)
THR (0,0,0) BLK (15,0,0) TMACRD (0,1920,0,0,0) SMEMADDR (0x7f8700008400)
THR (0,0,0) BLK (48,0,0) TMACRD (0,0,0,0,0) SMEMADDR (0x7f8700008400)
...

Both CTAs show the same shared memory address, even though the GPU (RTX 5090) does not support clusters.

After this happens, the kernel hits an illegal instruction error when executing the following CUTLASS TMA-related operations (cp.async.bulk.tensor.2d etc.).

So my questions are:

  1. Is it expected that different CTAs print the same shared memory pointer value?
    My impression is that shared memory is CTA-scope only, so this result seems wrong.
  2. Could this incorrect shared memory address be the reason for the illegal instruction exception?
    Since the kernel uses TMA instructions, I want to confirm whether incorrect smem addressing could trigger this.
  3. Is there any known issue with smem pointer printing or smem addressing in CUTLASS kernels when debug-printing?

Any insights would be appreciated. Thanks!

The shared address is a pointer into a window. The window (base) is different for each CTA.

You can run a fairly simple experiment yourself. Declare a __shared__ variable in a kernel, take the address of that variable, and print it out. It will be the same across all threadblocks. That is expected behavior.

# cat t432.cu
#include <cstdio>

__global__ void k(){

  __shared__ int a;
  printf("%lu\n", (unsigned long long)&a);
}

int main(){

  k<<<4,1>>>();
  cudaDeviceSynchronize();
}
# nvcc -o t432 t432.cu
# compute-sanitizer ./t432
========= COMPUTE-SANITIZER
140118023208960
140118023208960
140118023208960
140118023208960
========= ERROR SUMMARY: 0 errors
#

Thank you for your reply. If the starting address of shared memory that all thread blocks point to is the same, how does each block get its own shared memory without being affected by other blocks?

  1. Can it be understood as just a virtual address, with the actual physical addresses being different? If so, how can we determine the actual physical address?
  2. Moreover, even if they are supposed to be the same, why does the output I printed appear in two parts, with two different shared memory addresses?

How did you arrive at the conclusion that RTX 5090 does not support clusters? RTX 5090 is a CC 12.0 device, which according to the programming guide supports clusters.

RTX 5090 does not support cluster sub-bytes TMA — that part was my mistake earlier for not stating it clearly.

Thanks for clarification.

It is no surprise that shared memory adress is the same for all thread blocks. This behavior is not exclusive to shared memory. For example, the same SASS code is executed by all threads and blocks, yet register R42 of thread 1 is different from register R42 of thread 2. The hardware obviously keeps tracks which block / warp is currently executing and adjust the hardware accesses accordingly, but it is not documented how this is achieved.

Without showing code, it will be difficult for others to find out why in your case the adresses appear to be different, and why you get an illegal instruction.

You are assuming a physical unified memory with unique physical addresses.

In practice each memory has address lines and select lines (one or several ANDed ones).

The global memory, the shared memory, the L2 cache, the L1 cache - all are different memories. Even each GDDR RAM chip can be seen as one or several separate memories.

The notion of physical address space is from the view of a processor or a bus, those physical addresses can in turn be assumed or treated virtual by electronics behind it.

The shared memory definitely has some offset applied to each access as several blocks can run on one SM.

The physical address of the shared memory in each SM probably starts with 0?

(but we never know)

Out if curiosity, I extended Robert_Crovella’s example to thread block clusters.

#include <cooperative_groups.h>
#include <iostream>

namespace cg = cooperative_groups;

__cluster_dims__(4,1,1)
__global__
void kernel_cluster4(){
    __shared__ int a[1024];

    auto cluster = cg::this_cluster();
    int id = cluster.block_index().x;
    int* id0_a = cluster.map_shared_rank(&a[0], 0);
    int* id1_a = cluster.map_shared_rank(&a[0], 1);
    int* id2_a = cluster.map_shared_rank(&a[0], 2);
    int* id3_a = cluster.map_shared_rank(&a[0], 3);

    printf("block %d, id %d, a %p, id0_a %p, id1_a %p, id2_a %p, id3_a %p, isShared0 %d, isShared1 %d, isShared2 %d, isShared3 %d\n", 
        blockIdx.x, id, &a[0], id0_a, id1_a, id2_a, id3_a,
        __isShared(id0_a), __isShared(id1_a), __isShared(id2_a), __isShared(id3_a));

    cluster.sync();
}

__cluster_dims__(1,1,1)
__global__
void kernel_cluster1(){
    __shared__ int a[1024];

    auto cluster = cg::this_cluster();
    int id = cluster.block_index().x;
    int* id0_a = cluster.map_shared_rank(&a[0], 0);

    printf("block %d, id %d, a %p, id0_a %p\n", 
        blockIdx.x, id, &a[0], id0_a);

    cluster.sync();
}

int main(){
    kernel_cluster1<<<8,1>>>();
    cudaDeviceSynchronize();
    std::cout << "\n";

    kernel_cluster4<<<8,1>>>();
    cudaDeviceSynchronize();
    std::cout << "\n";
} 

Output (sorted)

block 0, id 0, a 0xe93600000400, id0_a 0xe93600000400
block 1, id 0, a 0xe93600000400, id0_a 0xe93600000400
block 2, id 0, a 0xe93600000400, id0_a 0xe93600000400
block 3, id 0, a 0xe93600000400, id0_a 0xe93600000400
block 4, id 0, a 0xe93600000400, id0_a 0xe93600000400
block 5, id 0, a 0xe93600000400, id0_a 0xe93600000400
block 6, id 0, a 0xe93600000400, id0_a 0xe93600000400
block 7, id 0, a 0xe93600000400, id0_a 0xe93600000400




block 0, id 0, a 0xe93600000400, id0_a 0xe93600000400, id1_a 0xe93601000400, id2_a 0xe93602000400, id3_a 0xe93603000400, isShared0 1, isShared1 0, isShared2 0, isShared3 0
block 1, id 1, a 0xe93601000400, id0_a 0xe93600000400, id1_a 0xe93601000400, id2_a 0xe93602000400, id3_a 0xe93603000400, isShared0 0, isShared1 1, isShared2 0, isShared3 0
block 2, id 2, a 0xe93602000400, id0_a 0xe93600000400, id1_a 0xe93601000400, id2_a 0xe93602000400, id3_a 0xe93603000400, isShared0 0, isShared1 0, isShared2 1, isShared3 0
block 3, id 3, a 0xe93603000400, id0_a 0xe93600000400, id1_a 0xe93601000400, id2_a 0xe93602000400, id3_a 0xe93603000400, isShared0 0, isShared1 0, isShared2 0, isShared3 1
block 4, id 0, a 0xe93600000400, id0_a 0xe93600000400, id1_a 0xe93601000400, id2_a 0xe93602000400, id3_a 0xe93603000400, isShared0 1, isShared1 0, isShared2 0, isShared3 0
block 5, id 1, a 0xe93601000400, id0_a 0xe93600000400, id1_a 0xe93601000400, id2_a 0xe93602000400, id3_a 0xe93603000400, isShared0 0, isShared1 1, isShared2 0, isShared3 0
block 6, id 2, a 0xe93602000400, id0_a 0xe93600000400, id1_a 0xe93601000400, id2_a 0xe93602000400, id3_a 0xe93603000400, isShared0 0, isShared1 0, isShared2 1, isShared3 0
block 7, id 3, a 0xe93603000400, id0_a 0xe93600000400, id1_a 0xe93601000400, id2_a 0xe93602000400, id3_a 0xe93603000400, isShared0 0, isShared1 0, isShared2 0, isShared3 1

Blocks within a cluster have different adresses, but same block ranks in different clusters have the same adress

This error comes from the lower-level CUTLASS code, and I believe it is likely a CUTLASS issue. The specific failing instruction is:

cp.async.bulk.tensor.2d.shared::cta.global.mbarrier::complete_tx::bytes.L2::cache_hint

You can see that there is special handling here for sm120, but my understanding is that this instruction should be supported on an RTX 5090. And the address I obtained during debugging matches the description in my question, and it doesn’t seem to be abnormal. So why does it cause an illegal instruction on this device?

#if defined(CUTE_ARCH_TMA_SM90_ENABLED)
    uint64_t gmem_int_desc = reinterpret_cast<uint64_t>(desc_ptr);
    uint32_t smem_int_mbar = cast_smem_ptr_to_uint(mbar_ptr);
    uint32_t smem_int_ptr  = cast_smem_ptr_to_uint(smem_ptr);
    cutlass::arch::synclog_emit_tma_load(__LINE__, gmem_int_desc, smem_int_mbar, smem_int_ptr);
#if defined(CUTE_ARCH_TMA_SM120_ENABLED)
    asm volatile (
      "cp.async.bulk.tensor.2d.shared::cta.global.mbarrier::complete_tx::bytes.L2::cache_hint"
      " [%0], [%1, {%3, %4}], [%2], %5;"
      :
      : "r"(smem_int_ptr), "l"(gmem_int_desc), "r"(smem_int_mbar),
        "r"(crd0), "r"(crd1), "l"(cache_hint)
      : "memory");
#else
    asm volatile (
      "cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes.L2::cache_hint"
      " [%0], [%1, {%3, %4}], [%2], %5;"
      :
      : "r"(smem_int_ptr), "l"(gmem_int_desc), "r"(smem_int_mbar),
        "r"(crd0), "r"(crd1), "l"(cache_hint)
      : "memory");
#endif

If you have time and it’s convenient, I can provide my original CUTLASS code along with the setup and test files for you to examine. Thank you very much.

Thank you for the additional experiment. It really helps. So my address results seem like the combined result of multiple clusters?