Question about efficient usage of wmma

zisisnotzis · February 27, 2024, 12:30am

I’m recently playing around with wmma. It is cool to have instruction level matrix operation. But since it’s internal structure is undefined, it becomes “inefficient” in some cases?

First question, when I inspect the compiled kernel using nsight compute, I found that wmma<16,16,16,half> is loading the operands using multiple 32-bit loads. I though it might be better to load in 128-bit chunk, since 16(row)*16(col)*2(half) is exactly 32(a wrap)*16(max single fetch). Therefore, I tried to load the matrix into shared memory using wrap-cooperative 128-bit load. But the resulting performance is somewhat worse. I think this is due to latency of reading/writing shared memory.

Since you can exchange data within wrap directly using those wrap data exchange commands, any possibility of loading 128-bit for each thread, and possibly exchange data between then for correct position?

The second question is also internal layout related. How can I convert a wmma::accumulator back to wmma::matrix_a without writing/reading shared memory? I think there should be more efficient soultions.

Thanks!

Robert_Crovella · February 27, 2024, 2:30pm

For questions like this I often suggest that folks use the mma instructions rather than wmma. You’ll have to switch to PTX for this, currently, but it removes some of the “opaqueness” associated with wmma operations.

Curefab · February 29, 2024, 11:47pm

As added information: PTX can be used from C++ programs with inline assembly.

Topic		Replies	Views
What is the best way to re-use a tensor core C fragment now as A or B input when their types differ? CUDA Programming and Performance	5	649	November 24, 2023
How to use WMMA efficiently CUDA Programming and Performance	4	7129	October 23, 2020
WMMA - What does "warp matrix operations" mean? CUDA Programming and Performance cuda , kernel	7	5273	October 18, 2022
load int8 shared memory data into fp16 wmma::fragment CUDA Programming and Performance	0	498	August 7, 2019
Suspected wrap size issue with my matrix-related kernel code CUDA Programming and Performance	11	541	December 25, 2017
How does the operation like "some_fragment.x[index]" work in wmma api? CUDA Programming and Performance cuda , kernel	4	453	March 26, 2024
[Question] How does the threads in a warp work collectively? CUDA Programming and Performance	5	143	July 8, 2024
Padding of mma operation CUDA Programming and Performance	20	26	December 19, 2024
Wmma f16 load always loads into 8 2xf16 registers CUDA Programming and Performance	4	565	September 9, 2023
Turing 16x16 MMA, SM usage, 1 or 2? CUDA Programming and Performance	2	999	December 8, 2018

Question about efficient usage of wmma

Related topics