On the Hopper architecture, is using TMA technology and multi-stage technology contradictory?

As listed here, why we do not have both TMA and multi stage?

in case anyone is wondering, these are part of cutlass

1 Like

Perhaps because TMA is a certain way of doing it and multi-stage is more flexible?

1 Like

Well, in multi stage version, they even do not use tma!..why these two are contradictory? It seems that TMA uses two buffers now. But theoretically, it can also be multi stages, just, I have not tested yet, maybe no need to be multi stages? Current two buffers TMA have enable GEMM to be very compute bound?

Read here about multistage: cutlass/media/docs/efficient_gemm.md at main · NVIDIA/cutlass · GitHub

TMA is a new feature of Hopper. Probably the programmers created a specific optimized kernel variant for it without caring for the design of multistage.

It is a different question, whether TMA can be integrated back into multi-stage without loosing any performance gains.

1 Like