How cluster influence GEMM or other application?

I tested cutlass’s cluster, well… it has some influence, but not too much… what I can find is TMA’s multicast is within a cluster. Any other factor we want to use cluster? For… GEMM or other application?

What do you mean with it has some influence? Influence on what? Compared to what? What did you expect, what was the result? Speed? Which platform?

1 Like

Such as you are writing cutlass code with sm90, you set cluster_shape=<1, 2, 1>, and then you change it into <1, 4, 1>. You will have different running time. Why?

Or my question can be, for GEMM, what optimization can cluster do? Given that, cutlass does not implement SM-to-SM communication, and NCU shows the DSMEM usage is 0. Why cluster influence performance?