LSU Wavefront Scheduling and Shared Memory Bank Utilization on Blackwell

aidando73 · February 5, 2026, 4:44pm

I’ve been reading through Greg’s great explanation of LSU behavior on Ampere architecture in this thread, and I have a couple of follow-up questions.

Question 1: Blackwell Architecture

Does the LSU/MIO architecture described in that post still apply to Blackwell (SM 100)? Or have there been architectural changes to the load/store pipeline in Blackwell?

Question 2: Cross-Warp Wavefront Merging for Shared Memory

I’m currently optimizing a kernel that’s showing significant shared memory bank conflicts, and so I want to know how wavefronts are formed to see if there’s any thing special I need to do for it.

From the original post:

The LSU pipeline accepts 1 instruction per cycle. The LSU pipeline can contain 100s of inflight instructions.

Given that the LSU pipeline can have hundreds of in-flight instructions (including LDS/STS from multiple warps) can the LSU merge memory operations from different warps into the same wavefront to maximize utilization of all shared memory banks?

For example: if warp A is accessing banks 0-7 and warp B is accessing banks 8-15, could the hardware combine these into a single wavefront that accesses all 16 banks (or 32 on newer architectures) simultaneously?

Or are wavefronts strictly formed from a single warp’s instruction, meaning bank utilization is limited by each individual warp’s access pattern?

Thanks in advance for any clarification.

cc @Greg

Curefab · February 5, 2026, 5:51pm

Can you store as linear stream and sort later?

Or instead of random reading create requests, sort them and generate and process the results in a 2-stage algorithm?

Or increase the read/write size per thread, e.g. instead of accessing a struct of arrays, access an array of struct? Works if several arrays are accessed with the same random index.

Bank conflicts while reading can also be reduced by storing multiple copies.

Bank conflicts while writing can be reduced by narrowing down which lane (threadIdx % 32) may write to which bank.

Bank conflicts for reading and writing can be reduced by increasing the transaction size to 64 or 128 bits.

Robert_Crovella · February 5, 2026, 7:21pm

for shared memory, bank conflicts arise on a per-warp, per instruction basis. there is no combination of different warps, nor is there any combination of different instructions issued to the same warp.

vipuls181999 · February 5, 2026, 8:20pm

Just to confirm, are you saying that even though there’s multiple/4 warp schedulers, there’s no contention/bank conflict between the accesses to shared memory from those concurrent warps? Do instructions from different warps (across the warp schedulers) just get serialized by default?

Curefab · February 5, 2026, 8:29pm

It is the MIO pipeline Memory I/O for shared resources between the 4 partitions. It is serialized.

Robert_Crovella · February 5, 2026, 8:50pm

The general notion of shared performance is 32 bits per bank per clock cycle (per SM). AFAIK that is still correct(&). To the extent that shared bandwidth (on a given SM) is consumed in a single clock cycle, whatever is used, is used by no more than a single instruction, issued to no more than a single warp.

If two warps (in the same SM) have shared memory needs, it will require at least 2 clock cycles to service them both. If a given warp has 2 or more instructions issued that involve shared access, it will require at least 2 or more clock cycles to service them.

(&) There were some kepler (cc3.5) variants that had possible deviations from this. There are no other deviations in any of the GPU families I am familiar with from cc2.x to cc12.x

aidando73 · February 6, 2026, 1:58am

Thanks @Robert_Crovella - this clears things up for me

Topic		Replies	Views
Clarification: bank_conflicts metric vs wavefronts for shared memory LDS.128 CUDA Programming and Performance	1	24	February 6, 2026
Shared memory bank conflict CUDA Programming and Performance	4	523	July 30, 2025
Difference in number of wavefronts for strided access to shared-memory and L1 cache in Ampere GPUs Nsight Compute hw	3	932	February 6, 2026
Can threads from different warps access shared memory at the same time? CUDA Programming and Performance	4	847	April 22, 2024
How to understand the bank conflict of shared_mem CUDA Programming and Performance	16	14806	November 19, 2025
Questions about "L1 Conflicts Shared N-way" & metrics related to "Excessive" CUDA Programming and Performance	6	678	July 1, 2025
Requesting clarification for Shared Memory Bank Conflicts and Shared memory access? CUDA Programming and Performance hw , cuda	11	5046	January 23, 2024
handle bank conflicts on shared memory of Fermi devices? How does the hardware work CUDA Programming and Performance	5	6994	November 15, 2010
How to interpret the difference between LSU utilization and Shared Memory utilization (in case of shared memory access only)? Nsight Compute cuda , kernel	0	612	July 6, 2022
Shared Memory Bank conflicts with 64 bit data type Nsight Compute	7	656	June 30, 2024

LSU Wavefront Scheduling and Shared Memory Bank Utilization on Blackwell

Related topics