Discrepancy in LDS.128 Wavefronts: Does SMEM broadcast across quarter-warps?

Hi, NV experts.

I have a question regarding the shared memory broadcast mechanism for LDS.128 instructions.

Experiment Platform

  • A10, CC 8.6
  • H100, CC 9.0

Problem Description

In my program, there are 2 distinct LDS.128 access patterns within a single warp:

# Access Pattern 1

| Memory Offset | Accessing Thread          |
|---------------|---------------------------|
| + 0 bytes     |  Thread 0 and Thread 16   |
| + 16 bytes    |  (Skipped)                |
| + 32 bytes    |  Thread 1 and Thread 17   |
| + 48 bytes    |  (Skipped)                |
| + 64 bytes    |  Thread 2 and Thread 18   |
| + 80 bytes    |  (Skipped)                |
| ...           |  ...                      |
| + 448 bytes   |  Thread 14 and Thread 30  |
| + 464 bytes   |  (Skipped)                |
| + 480 bytes   |  Thread 15 and Thread 31  |
| + 496 bytes   |  (Skipped)                |

# Access Pattern 2

| Memory Offset | Accessing Thread |
|---------------|------------------|
| + 0 bytes     |  Threads 0~15    |
| + 16 bytes    |  (Skipped)       |
| + 32 bytes    |  Threads 16~31   |
| + 48 bytes    |  (Skipped)       |

Here’s my NCU profile results (same on both platforms):

metric Access Pattern 1 Access Pattern 2
derived__memory_l1_conflicts_shared_nway 8 2
derived__memory_l1_wavefronts_shared_excessive Yes No (Zero)

As expected, Access Pattern 1 raises bank conflicts and excessive wavefronts, while Access Pattern 2 does not. As I know,

  • LDS.128 instruction is executed by quarter-warp. (based on previous forum discussions here and here)
  • Ideally, a contiguous-access LDS.128 takes 4 wavefronts to finish (1 per quarter-warp).
  • N-way bank conflicts means N threads hit the same bank, requiring N wavefronts to finish execution.

Using other metrics, I verified that Access Pattern 1 requires 8 wavefronts in total, while Access Pattern 2 requires only 2 wavefronts. Then I tried to explain these numbers.

With Access Pattern 1, every quarter-warp has 2-way conflicts and requires 2 wavefronts. My assumption is that broadcast does not happen across quarter-warps. For example, Thread 0 and Thread 16 does not share the same fetch from SMEM. So all quarter-warps require 4*2=8 wavefronts in total.

With Access Pattern 2, every quarter-warp can benefit from broadcast and finish execution in 1 wavefronts. There should be 4*1=4 wavefronts in total. But the resulting 2 wavefronts seems to be a proof of broadcast between quarter-warps. This contradicts my assumption.

My Questions:

  1. How does the SMEM broadcast mechanism actually work for LDS.128 instructions across quarter/half warps?
  2. Why is the hardware capable of broadcasting across quarter-warps in Pattern 2, but unable (or unwilling) to do so for Thread 0 and Thread 16 in Pattern 1?

Thank you in advance for your insights!

1 Like