I’ve been reading through Greg’s great explanation of LSU behavior on Ampere architecture in this thread, and I have a couple of follow-up questions.
Question 1: Blackwell Architecture
Does the LSU/MIO architecture described in that post still apply to Blackwell (SM 100)? Or have there been architectural changes to the load/store pipeline in Blackwell?
Question 2: Cross-Warp Wavefront Merging for Shared Memory
I’m currently optimizing a kernel that’s showing significant shared memory bank conflicts, and so I want to know how wavefronts are formed to see if there’s any thing special I need to do for it.
From the original post:
The LSU pipeline accepts 1 instruction per cycle. The LSU pipeline can contain 100s of inflight instructions.
Given that the LSU pipeline can have hundreds of in-flight instructions (including LDS/STS from multiple warps) can the LSU merge memory operations from different warps into the same wavefront to maximize utilization of all shared memory banks?
For example: if warp A is accessing banks 0-7 and warp B is accessing banks 8-15, could the hardware combine these into a single wavefront that accesses all 16 banks (or 32 on newer architectures) simultaneously?
Or are wavefronts strictly formed from a single warp’s instruction, meaning bank utilization is limited by each individual warp’s access pattern?
Thanks in advance for any clarification.
cc @Greg
Can you store as linear stream and sort later?
Or instead of random reading create requests, sort them and generate and process the results in a 2-stage algorithm?
Or increase the read/write size per thread, e.g. instead of accessing a struct of arrays, access an array of struct? Works if several arrays are accessed with the same random index.
Bank conflicts while reading can also be reduced by storing multiple copies.
Bank conflicts while writing can be reduced by narrowing down which lane (threadIdx % 32) may write to which bank.
Bank conflicts for reading and writing can be reduced by increasing the transaction size to 64 or 128 bits.
for shared memory, bank conflicts arise on a per-warp, per instruction basis. there is no combination of different warps, nor is there any combination of different instructions issued to the same warp.
2 Likes
Just to confirm, are you saying that even though there’s multiple/4 warp schedulers, there’s no contention/bank conflict between the accesses to shared memory from those concurrent warps? Do instructions from different warps (across the warp schedulers) just get serialized by default?
It is the MIO pipeline Memory I/O for shared resources between the 4 partitions. It is serialized.
1 Like
The general notion of shared performance is 32 bits per bank per clock cycle (per SM). AFAIK that is still correct(&). To the extent that shared bandwidth (on a given SM) is consumed in a single clock cycle, whatever is used, is used by no more than a single instruction, issued to no more than a single warp.
If two warps (in the same SM) have shared memory needs, it will require at least 2 clock cycles to service them both. If a given warp has 2 or more instructions issued that involve shared access, it will require at least 2 or more clock cycles to service them.
(&) There were some kepler (cc3.5) variants that had possible deviations from this. There are no other deviations in any of the GPU families I am familiar with from cc2.x to cc12.x
2 Likes
Thanks @Robert_Crovella - this clears things up for me