Different CTAs Accessing the Same Shared Memory Address on RTX 5090 — Is This Expected?

hyaloids · November 21, 2025, 2:50am

Hi, I’m debugging a CUTLASS kernel on an RTX 5090 (which does not support clusters), and I noticed something strange.

According to my understanding, shared memory is allocated per-CTA, so different thread blocks should never access the same shared memory address. However, in my debug prints, I see different CTAs reporting the same shared memory pointer value. For example, logs look like this:

THR (0,0,0) BLK (6,0,0) TMACRD (0,0,0,0,0) SMEMADDR (0x7f8700000400)
THR (0,0,0) BLK (12,0,0) TMACRD (0,0,0,0,0) SMEMADDR (0x7f8700000400)
THR (0,0,0) BLK (4,0,0) TMACRD (0,0,0,0,0) SMEMADDR (0x7f8700000400)
THR (0,0,0) BLK (10,0,0) TMACRD (0,0,0,0,0) SMEMADDR (0x7f8700000400)
THR (0,0,0) BLK (7,0,0) TMACRD (0,0,0,0,0) SMEMADDR (0x7f8700000400)
THR (0,0,0) BLK (8,0,0) TMACRD (0,0,0,0,0) SMEMADDR (0x7f8700000400)
...
THR (0,0,0) BLK (49,0,0) TMACRD (0,128,0,0,0) SMEMADDR (0x7f8700008400)
THR (0,0,0) BLK (56,0,0) TMACRD (0,1024,0,0,0) SMEMADDR (0x7f8700008400)
THR (0,0,0) BLK (47,0,0) TMACRD (0,1920,0,0,0) SMEMADDR (0x7f8700008400)
THR (0,0,0) BLK (57,0,0) TMACRD (0,1152,0,0,0) SMEMADDR (0x7f8700008400)
THR (0,0,0) BLK (15,0,0) TMACRD (0,1920,0,0,0) SMEMADDR (0x7f8700008400)
THR (0,0,0) BLK (48,0,0) TMACRD (0,0,0,0,0) SMEMADDR (0x7f8700008400)
...

Both CTAs show the same shared memory address, even though the GPU (RTX 5090) does not support clusters.

After this happens, the kernel hits an illegal instruction error when executing the following CUTLASS TMA-related operations (cp.async.bulk.tensor.2d etc.).

So my questions are:

Is it expected that different CTAs print the same shared memory pointer value?
My impression is that shared memory is CTA-scope only, so this result seems wrong.
Could this incorrect shared memory address be the reason for the illegal instruction exception?
Since the kernel uses TMA instructions, I want to confirm whether incorrect smem addressing could trigger this.
Is there any known issue with smem pointer printing or smem addressing in CUTLASS kernels when debug-printing?

Any insights would be appreciated. Thanks!

Robert_Crovella · November 21, 2025, 4:59am

The shared address is a pointer into a window. The window (base) is different for each CTA.

You can run a fairly simple experiment yourself. Declare a __shared__ variable in a kernel, take the address of that variable, and print it out. It will be the same across all threadblocks. That is expected behavior.

# cat t432.cu
#include <cstdio>

__global__ void k(){

  __shared__ int a;
  printf("%lu\n", (unsigned long long)&a);
}

int main(){

  k<<<4,1>>>();
  cudaDeviceSynchronize();
}
# nvcc -o t432 t432.cu
# compute-sanitizer ./t432
========= COMPUTE-SANITIZER
140118023208960
140118023208960
140118023208960
140118023208960
========= ERROR SUMMARY: 0 errors
#

hyaloids · November 21, 2025, 6:07am

Thank you for your reply. If the starting address of shared memory that all thread blocks point to is the same, how does each block get its own shared memory without being affected by other blocks?

Can it be understood as just a virtual address, with the actual physical addresses being different? If so, how can we determine the actual physical address?
Moreover, even if they are supposed to be the same, why does the output I printed appear in two parts, with two different shared memory addresses?

striker159 · November 21, 2025, 8:01am

How did you arrive at the conclusion that RTX 5090 does not support clusters? RTX 5090 is a CC 12.0 device, which according to the programming guide supports clusters.

hyaloids · November 21, 2025, 8:17am

RTX 5090 does not support cluster sub-bytes TMA — that part was my mistake earlier for not stating it clearly.

striker159 · November 21, 2025, 8:31am

Thanks for clarification.

It is no surprise that shared memory adress is the same for all thread blocks. This behavior is not exclusive to shared memory. For example, the same SASS code is executed by all threads and blocks, yet register R42 of thread 1 is different from register R42 of thread 2. The hardware obviously keeps tracks which block / warp is currently executing and adjust the hardware accesses accordingly, but it is not documented how this is achieved.

Without showing code, it will be difficult for others to find out why in your case the adresses appear to be different, and why you get an illegal instruction.

Curefab · November 21, 2025, 8:33am

You are assuming a physical unified memory with unique physical addresses.

In practice each memory has address lines and select lines (one or several ANDed ones).

The global memory, the shared memory, the L2 cache, the L1 cache - all are different memories. Even each GDDR RAM chip can be seen as one or several separate memories.

The notion of physical address space is from the view of a processor or a bus, those physical addresses can in turn be assumed or treated virtual by electronics behind it.

The shared memory definitely has some offset applied to each access as several blocks can run on one SM.

The physical address of the shared memory in each SM probably starts with 0?

(but we never know)

striker159 · November 21, 2025, 9:01am

Out if curiosity, I extended Robert_Crovella’s example to thread block clusters.

#include <cooperative_groups.h>
#include <iostream>

namespace cg = cooperative_groups;

__cluster_dims__(4,1,1)
__global__
void kernel_cluster4(){
    __shared__ int a[1024];

    auto cluster = cg::this_cluster();
    int id = cluster.block_index().x;
    int* id0_a = cluster.map_shared_rank(&a[0], 0);
    int* id1_a = cluster.map_shared_rank(&a[0], 1);
    int* id2_a = cluster.map_shared_rank(&a[0], 2);
    int* id3_a = cluster.map_shared_rank(&a[0], 3);

    printf("block %d, id %d, a %p, id0_a %p, id1_a %p, id2_a %p, id3_a %p, isShared0 %d, isShared1 %d, isShared2 %d, isShared3 %d\n", 
        blockIdx.x, id, &a[0], id0_a, id1_a, id2_a, id3_a,
        __isShared(id0_a), __isShared(id1_a), __isShared(id2_a), __isShared(id3_a));

    cluster.sync();
}

__cluster_dims__(1,1,1)
__global__
void kernel_cluster1(){
    __shared__ int a[1024];

    auto cluster = cg::this_cluster();
    int id = cluster.block_index().x;
    int* id0_a = cluster.map_shared_rank(&a[0], 0);

    printf("block %d, id %d, a %p, id0_a %p\n", 
        blockIdx.x, id, &a[0], id0_a);

    cluster.sync();
}

int main(){
    kernel_cluster1<<<8,1>>>();
    cudaDeviceSynchronize();
    std::cout << "\n";

    kernel_cluster4<<<8,1>>>();
    cudaDeviceSynchronize();
    std::cout << "\n";
}

Output (sorted)

block 0, id 0, a 0xe93600000400, id0_a 0xe93600000400
block 1, id 0, a 0xe93600000400, id0_a 0xe93600000400
block 2, id 0, a 0xe93600000400, id0_a 0xe93600000400
block 3, id 0, a 0xe93600000400, id0_a 0xe93600000400
block 4, id 0, a 0xe93600000400, id0_a 0xe93600000400
block 5, id 0, a 0xe93600000400, id0_a 0xe93600000400
block 6, id 0, a 0xe93600000400, id0_a 0xe93600000400
block 7, id 0, a 0xe93600000400, id0_a 0xe93600000400




block 0, id 0, a 0xe93600000400, id0_a 0xe93600000400, id1_a 0xe93601000400, id2_a 0xe93602000400, id3_a 0xe93603000400, isShared0 1, isShared1 0, isShared2 0, isShared3 0
block 1, id 1, a 0xe93601000400, id0_a 0xe93600000400, id1_a 0xe93601000400, id2_a 0xe93602000400, id3_a 0xe93603000400, isShared0 0, isShared1 1, isShared2 0, isShared3 0
block 2, id 2, a 0xe93602000400, id0_a 0xe93600000400, id1_a 0xe93601000400, id2_a 0xe93602000400, id3_a 0xe93603000400, isShared0 0, isShared1 0, isShared2 1, isShared3 0
block 3, id 3, a 0xe93603000400, id0_a 0xe93600000400, id1_a 0xe93601000400, id2_a 0xe93602000400, id3_a 0xe93603000400, isShared0 0, isShared1 0, isShared2 0, isShared3 1
block 4, id 0, a 0xe93600000400, id0_a 0xe93600000400, id1_a 0xe93601000400, id2_a 0xe93602000400, id3_a 0xe93603000400, isShared0 1, isShared1 0, isShared2 0, isShared3 0
block 5, id 1, a 0xe93601000400, id0_a 0xe93600000400, id1_a 0xe93601000400, id2_a 0xe93602000400, id3_a 0xe93603000400, isShared0 0, isShared1 1, isShared2 0, isShared3 0
block 6, id 2, a 0xe93602000400, id0_a 0xe93600000400, id1_a 0xe93601000400, id2_a 0xe93602000400, id3_a 0xe93603000400, isShared0 0, isShared1 0, isShared2 1, isShared3 0
block 7, id 3, a 0xe93603000400, id0_a 0xe93600000400, id1_a 0xe93601000400, id2_a 0xe93602000400, id3_a 0xe93603000400, isShared0 0, isShared1 0, isShared2 0, isShared3 1

Blocks within a cluster have different adresses, but same block ranks in different clusters have the same adress

hyaloids · November 21, 2025, 9:04am

This error comes from the lower-level CUTLASS code, and I believe it is likely a CUTLASS issue. The specific failing instruction is:

cp.async.bulk.tensor.2d.shared::cta.global.mbarrier::complete_tx::bytes.L2::cache_hint

You can see that there is special handling here for sm120, but my understanding is that this instruction should be supported on an RTX 5090. And the address I obtained during debugging matches the description in my question, and it doesn’t seem to be abnormal. So why does it cause an illegal instruction on this device?

#if defined(CUTE_ARCH_TMA_SM90_ENABLED)
    uint64_t gmem_int_desc = reinterpret_cast<uint64_t>(desc_ptr);
    uint32_t smem_int_mbar = cast_smem_ptr_to_uint(mbar_ptr);
    uint32_t smem_int_ptr  = cast_smem_ptr_to_uint(smem_ptr);
    cutlass::arch::synclog_emit_tma_load(__LINE__, gmem_int_desc, smem_int_mbar, smem_int_ptr);
#if defined(CUTE_ARCH_TMA_SM120_ENABLED)
    asm volatile (
      "cp.async.bulk.tensor.2d.shared::cta.global.mbarrier::complete_tx::bytes.L2::cache_hint"
      " [%0], [%1, {%3, %4}], [%2], %5;"
      :
      : "r"(smem_int_ptr), "l"(gmem_int_desc), "r"(smem_int_mbar),
        "r"(crd0), "r"(crd1), "l"(cache_hint)
      : "memory");
#else
    asm volatile (
      "cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes.L2::cache_hint"
      " [%0], [%1, {%3, %4}], [%2], %5;"
      :
      : "r"(smem_int_ptr), "l"(gmem_int_desc), "r"(smem_int_mbar),
        "r"(crd0), "r"(crd1), "l"(cache_hint)
      : "memory");
#endif

If you have time and it’s convenient, I can provide my original CUTLASS code along with the setup and test files for you to examine. Thank you very much.

hyaloids · November 21, 2025, 9:08am

Thank you for the additional experiment. It really helps. So my address results seem like the combined result of multiple clusters?

Topic		Replies	Views
Pass shared mem address to helper function . CUDA Programming and Performance	4	705	April 5, 2011
COMPUTE-SANITIZER " Invalid __shared__ write of size 16 bytes " CUDA Programming and Performance	3	1289	August 17, 2022
Shared Memory Again What is happening Here CUDA Programming and Performance	11	2283	June 5, 2009
__syncthreads and shared memory CUDA Programming and Performance	21	4663	June 15, 2011
Results differ when compiled with sm_10 and sm_20 CUDA Programming and Performance	19	2621	June 17, 2011
One thread write into shared memory has limit? CUDA Programming and Performance	4	690	July 28, 2022
Using Shared Memory in CUDA C/C++ Technical Blog	36	2354	October 8, 2020
Some confusion on using shared memory. CUDA Programming and Performance	26	9415	June 2, 2009
RuntimeError: CUDA error: an illegal memory access was encountered CUDA Programming and Performance	0	1183	March 3, 2020
nvcc -arch sm_20 causes access violations in shared memory CUDA Programming and Performance	8	2682	March 30, 2013

Different CTAs Accessing the Same Shared Memory Address on RTX 5090 — Is This Expected?

Related topics