Discrepancy in LDS.128 Wavefronts: Does SMEM broadcast across quarter-warps?

kayzee3327 · June 12, 2026, 1:48pm

Hi, NV experts.

I have a question regarding the shared memory broadcast mechanism for LDS.128 instructions.

Experiment Platform

A10, CC 8.6
H100, CC 9.0

Problem Description

In my program, there are 2 distinct LDS.128 access patterns within a single warp:

# Access Pattern 1

| Memory Offset | Accessing Thread          |
|---------------|---------------------------|
| + 0 bytes     |  Thread 0 and Thread 16   |
| + 16 bytes    |  (Skipped)                |
| + 32 bytes    |  Thread 1 and Thread 17   |
| + 48 bytes    |  (Skipped)                |
| + 64 bytes    |  Thread 2 and Thread 18   |
| + 80 bytes    |  (Skipped)                |
| ...           |  ...                      |
| + 448 bytes   |  Thread 14 and Thread 30  |
| + 464 bytes   |  (Skipped)                |
| + 480 bytes   |  Thread 15 and Thread 31  |
| + 496 bytes   |  (Skipped)                |

# Access Pattern 2

| Memory Offset | Accessing Thread |
|---------------|------------------|
| + 0 bytes     |  Threads 0~15    |
| + 16 bytes    |  (Skipped)       |
| + 32 bytes    |  Threads 16~31   |
| + 48 bytes    |  (Skipped)       |

Here’s my NCU profile results (same on both platforms):

metric	Access Pattern 1	Access Pattern 2
`derived__memory_l1_conflicts_shared_nway`	8	2
`derived__memory_l1_wavefronts_shared_excessive`	Yes	No (Zero)

As expected, Access Pattern 1 raises bank conflicts and excessive wavefronts, while Access Pattern 2 does not. As I know,

LDS.128 instruction is executed by quarter-warp. (based on previous forum discussions here and here)
Ideally, a contiguous-access LDS.128 takes 4 wavefronts to finish (1 per quarter-warp).
N-way bank conflicts means N threads hit the same bank, requiring N wavefronts to finish execution.

Using other metrics, I verified that Access Pattern 1 requires 8 wavefronts in total, while Access Pattern 2 requires only 2 wavefronts. Then I tried to explain these numbers.

With Access Pattern 1, every quarter-warp has 2-way conflicts and requires 2 wavefronts. My assumption is that broadcast does not happen across quarter-warps. For example, Thread 0 and Thread 16 does not share the same fetch from SMEM. So all quarter-warps require 4*2=8 wavefronts in total.

With Access Pattern 2, every quarter-warp can benefit from broadcast and finish execution in 1 wavefronts. There should be 4*1=4 wavefronts in total. But the resulting 2 wavefronts seems to be a proof of broadcast between quarter-warps. This contradicts my assumption.

My Questions:

How does the SMEM broadcast mechanism actually work for LDS.128 instructions across quarter/half warps?
Why is the hardware capable of broadcasting across quarter-warps in Pattern 2, but unable (or unwilling) to do so for Thread 0 and Thread 16 in Pattern 1?

Thank you in advance for your insights!

Topic		Replies	Views
Clarification: bank_conflicts metric vs wavefronts for shared memory LDS.128 CUDA Programming and Performance	1	72	February 6, 2026
Unexpected shared memory bank conflict. CUDA Programming and Performance	3	1172	July 30, 2025
Understanding the behaivor of ldmatrix in terms of shared memory access CUDA Programming and Performance cuda	1	1991	January 12, 2024
LSU Wavefront Scheduling and Shared Memory Bank Utilization on Blackwell CUDA Programming and Performance	6	273	February 6, 2026
LDS.128 loads from shared memory CUDA Programming and Performance	3	786	September 11, 2023
Does shared memory have "broadcast" behavior? CUDA Programming and Performance	8	3980	June 12, 2022
Bandwidth of shared memory load CUDA Programming and Performance	1	193	June 17, 2024
Problem about bank conflict test CUDA Programming and Performance	5	705	March 12, 2024
Conflict in shared memory CUDA Programming and Performance	5	5926	November 16, 2010
Questions about "L1 Conflicts Shared N-way" & metrics related to "Excessive" CUDA Programming and Performance	6	729	July 1, 2025

Discrepancy in LDS.128 Wavefronts: Does SMEM broadcast across quarter-warps?

Related topics