I’m recently playing around with wmma. It is cool to have instruction level matrix operation. But since it’s internal structure is undefined, it becomes “inefficient” in some cases?
First question, when I inspect the compiled kernel using nsight compute, I found that wmma<16,16,16,half>
is loading the operands using multiple 32-bit loads. I though it might be better to load in 128-bit chunk, since 16(row)*16(col)*2(half) is exactly 32(a wrap)*16(max single fetch). Therefore, I tried to load the matrix into shared memory using wrap-cooperative 128-bit load. But the resulting performance is somewhat worse. I think this is due to latency of reading/writing shared memory.
Since you can exchange data within wrap directly using those wrap data exchange commands, any possibility of loading 128-bit for each thread, and possibly exchange data between then for correct position?
The second question is also internal layout related. How can I convert a wmma::accumulator
back to wmma::matrix_a
without writing/reading shared memory? I think there should be more efficient soultions.
Thanks!