I am currently exploring the
wgmma.mma_async instruction and attempting to utilize it with shared memory. I have written a code snippet resembling the one below. However, I am encountering some difficulties when it comes to loading input matrices into shared memory and constructing matrix descriptors with and without swizzling.
I was wondering if there are any readily available example codes showcasing the usage of
In the past, during the early days of CUDA, there were often informative blog posts that served as excellent resources for learning about new features.
// load input matrix-a to shared memory
// load input matrix-b to shared memory
wgmma.mma_async.sync.aligned.m64n128k16.f16.f16.f16 ... descriptor_a, descriptor_b, ....
The descriptor format is explained in the ptx documentation: 1. Introduction — parallel-thread-execution 8.1 documentation
What are your difficulties constructing the descriptors?
I’m currently facing difficulties in obtaining the correct result matrix. The issue lies in either incorrect results or some threads not producing any results.
Here are the descriptors I’m using for
wgmma.mma_async.sync.aligned.m64n128k16.f16.f16.f16. I initially started without swizzling (0x0). I launch 1 thread block with 128 threads for simplicity.
start : 0x0040
leading_off: 0x0010 (16)
stride_off : 0x0100 (256)
swizzle : 0x0
start : 0x00c0
leading_off: 0x0800 (2048)
stride_off : 0x0010 (16)
swizzle : 0x0
The descriptors look correct to me. I won’t be able to help you with the ptx code though.
However, you may take a look at CUTLASS which supports wgmma. CUTLASS 3.0 is now available! · NVIDIA/cutlass · Discussion #787 · GitHub