I am currently exploring the wgmma.mma_async
instruction and attempting to utilize it with shared memory. I have written a code snippet resembling the one below. However, I am encountering some difficulties when it comes to loading input matrices into shared memory and constructing matrix descriptors with and without swizzling.
I was wondering if there are any readily available example codes showcasing the usage of wgmma.mma_async
?
In the past, during the early days of CUDA, there were often informative blog posts that served as excellent resources for learning about new features.
// load input matrix-a to shared memory
// load input matrix-b to shared memory
wgmma.fence.sync.aligned
wgmma.mma_async.sync.aligned.m64n128k16.f16.f16.f16 ... descriptor_a, descriptor_b, ....
wgmma.commit_group.sync.aligned
wgmma.wait_group.sync.aligned 0
1 Like
The descriptor format is explained in the ptx documentation: 1. Introduction — parallel-thread-execution 8.1 documentation
What are your difficulties constructing the descriptors?
1 Like
I’m currently facing difficulties in obtaining the correct result matrix. The issue lies in either incorrect results or some threads not producing any results.
Here are the descriptors I’m using for wgmma.mma_async.sync.aligned.m64n128k16.f16.f16.f16
. I initially started without swizzling (0x0). I launch 1 thread block with 128 threads for simplicity.
Descriptor_A: 0x0000010000100040
start : 0x0040
leading_off: 0x0010 (16)
stride_off : 0x0100 (256)
base_offset: 0x0
swizzle : 0x0
Descriptor_B: 0x00000010080000c0
start : 0x00c0
leading_off: 0x0800 (2048)
stride_off : 0x0010 (16)
base_offset: 0x0
swizzle : 0x0
1 Like
The descriptors look correct to me. I won’t be able to help you with the ptx code though.
However, you may take a look at CUTLASS which supports wgmma. CUTLASS 3.0 is now available! · NVIDIA/cutlass · Discussion #787 · GitHub
1 Like
agree! Examples are very important! cutlass is too difficult to use…