Hi, NV experts.
I have a question regarding the shared memory broadcast mechanism for LDS.128 instructions.
Experiment Platform
- A10, CC 8.6
- H100, CC 9.0
Problem Description
In my program, there are 2 distinct LDS.128 access patterns within a single warp:
# Access Pattern 1
| Memory Offset | Accessing Thread |
|---------------|---------------------------|
| + 0 bytes | Thread 0 and Thread 16 |
| + 16 bytes | (Skipped) |
| + 32 bytes | Thread 1 and Thread 17 |
| + 48 bytes | (Skipped) |
| + 64 bytes | Thread 2 and Thread 18 |
| + 80 bytes | (Skipped) |
| ... | ... |
| + 448 bytes | Thread 14 and Thread 30 |
| + 464 bytes | (Skipped) |
| + 480 bytes | Thread 15 and Thread 31 |
| + 496 bytes | (Skipped) |
# Access Pattern 2
| Memory Offset | Accessing Thread |
|---------------|------------------|
| + 0 bytes | Threads 0~15 |
| + 16 bytes | (Skipped) |
| + 32 bytes | Threads 16~31 |
| + 48 bytes | (Skipped) |
Here’s my NCU profile results (same on both platforms):
| metric | Access Pattern 1 | Access Pattern 2 |
|---|---|---|
derived__memory_l1_conflicts_shared_nway |
8 | 2 |
derived__memory_l1_wavefronts_shared_excessive |
Yes | No (Zero) |
As expected, Access Pattern 1 raises bank conflicts and excessive wavefronts, while Access Pattern 2 does not. As I know,
LDS.128instruction is executed by quarter-warp. (based on previous forum discussions here and here)- Ideally, a contiguous-access
LDS.128takes 4 wavefronts to finish (1 per quarter-warp). - N-way bank conflicts means N threads hit the same bank, requiring N wavefronts to finish execution.
Using other metrics, I verified that Access Pattern 1 requires 8 wavefronts in total, while Access Pattern 2 requires only 2 wavefronts. Then I tried to explain these numbers.
With Access Pattern 1, every quarter-warp has 2-way conflicts and requires 2 wavefronts. My assumption is that broadcast does not happen across quarter-warps. For example, Thread 0 and Thread 16 does not share the same fetch from SMEM. So all quarter-warps require 4*2=8 wavefronts in total.
With Access Pattern 2, every quarter-warp can benefit from broadcast and finish execution in 1 wavefronts. There should be 4*1=4 wavefronts in total. But the resulting 2 wavefronts seems to be a proof of broadcast between quarter-warps. This contradicts my assumption.
My Questions:
- How does the SMEM broadcast mechanism actually work for
LDS.128instructions across quarter/half warps? - Why is the hardware capable of broadcasting across quarter-warps in Pattern 2, but unable (or unwilling) to do so for Thread 0 and Thread 16 in Pattern 1?
Thank you in advance for your insights!