What is the functionality of LD/ST units in SM?

Take A100 for example, a SM is divided into for sectors, each of which has 8 LD/ST units, but usually every cycle there are 32 memory accesses one from each thread in a warp, so how do the 8 LD/ST units handle 32 memory accesses? If memory coalescing is met, do 8 LD/ST units manage to merge 32 memory accesses into one memory request in just one cycle?

@Robert_Crovella Hello, can you help me with this question?

The LSU is a pipelined unit, just like most other functional resources in a GPU SM. That means although a request will be submitted to it in a particular cycle, it will not necessarily all be processed in a single cycle.

In general, when there are fewer functional units in a SMSP than are needed for a particular instruction type, we could generally expect that the request will be processed over several cycles.

I won’t be able to give a detailed description of LSU behavior, however. Some additional information is available here.