Why multi-stage can accelerate GEMM?

I saw a diagram explaining multi-stage, but I’m confused. The number of shared memory reads that can be issued simultaneously should be limited by bandwidth, right? In diagram A, two reads are issued simultaneously, while in diagram B, three reads are issued simultaneously. Is this reasonable? In other words, with multi-stage, the number of buffers increases, but the read speed for each buffer decreases…?

Shared memory operations go into a pipeline. So the data could come at a later time compared to when you issued the instructions.

1 Like

When I propose an instruction, like ld.shared, the reading speed (byte/second) is a constant for this instruction?

If so, the total speed is determined by how much instruction provided at a same time.

If not, and if I propose multiple instruction, their will share a const reading speed, then maybe first thing first, the most urgent read instruction should be first, which weakens the meaning of multi-stage.

What do you think? Which one is correct?

Yes, there is a maximum bandwidth (reading speed) from shared memory. It is not important, if the instructions are issued at the same time or not as there is a pipeline with buffer inside the GPU of requests.

In the charts you showed you can see that the transactions of the buffers overlap, but they do not start at the same time. In a pipeline the read is split into micro-operations, and the actual read is only one of them. E.g. imagine it is always done 2cm from the right of each of the load blocks with a width of 5mm. Then they never are done at the same time. The problem here is not bandwidth, but latency.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.