Question about efficient usage of wmma

I’m recently playing around with wmma. It is cool to have instruction level matrix operation. But since it’s internal structure is undefined, it becomes “inefficient” in some cases?

First question, when I inspect the compiled kernel using nsight compute, I found that wmma<16,16,16,half> is loading the operands using multiple 32-bit loads. I though it might be better to load in 128-bit chunk, since 16(row)*16(col)*2(half) is exactly 32(a wrap)*16(max single fetch). Therefore, I tried to load the matrix into shared memory using wrap-cooperative 128-bit load. But the resulting performance is somewhat worse. I think this is due to latency of reading/writing shared memory.

Since you can exchange data within wrap directly using those wrap data exchange commands, any possibility of loading 128-bit for each thread, and possibly exchange data between then for correct position?

The second question is also internal layout related. How can I convert a wmma::accumulator back to wmma::matrix_a without writing/reading shared memory? I think there should be more efficient soultions.

Thanks!

For questions like this I often suggest that folks use the mma instructions rather than wmma. You’ll have to switch to PTX for this, currently, but it removes some of the “opaqueness” associated with wmma operations.

As added information: PTX can be used from C++ programs with inline assembly.