Example with wgmma.mma_async

I am currently exploring the wgmma.mma_async instruction and attempting to utilize it with shared memory. I have written a code snippet resembling the one below. However, I am encountering some difficulties when it comes to loading input matrices into shared memory and constructing matrix descriptors with and without swizzling.

I was wondering if there are any readily available example codes showcasing the usage of wgmma.mma_async?

In the past, during the early days of CUDA, there were often informative blog posts that served as excellent resources for learning about new features.

// load input matrix-a to shared memory
// load input matrix-b to shared memory


wgmma.mma_async.sync.aligned.m64n128k16.f16.f16.f16 ... descriptor_a, descriptor_b, ....

wgmma.wait_group.sync.aligned 0 

The descriptor format is explained in the ptx documentation: 1. Introduction — parallel-thread-execution 8.1 documentation

What are your difficulties constructing the descriptors?

I’m currently facing difficulties in obtaining the correct result matrix. The issue lies in either incorrect results or some threads not producing any results.

Here are the descriptors I’m using for wgmma.mma_async.sync.aligned.m64n128k16.f16.f16.f16. I initially started without swizzling (0x0). I launch 1 thread block with 128 threads for simplicity.

Descriptor_A: 0x0000010000100040
  start      :  0x0040
  leading_off:  0x0010 (16)
  stride_off :  0x0100 (256)
  base_offset:  0x0
  swizzle    :  0x0 

Descriptor_B: 0x00000010080000c0
  start      :  0x00c0
  leading_off:  0x0800 (2048)
  stride_off :  0x0010 (16)
  base_offset:  0x0
  swizzle    :  0x0 

The descriptors look correct to me. I won’t be able to help you with the ptx code though.
However, you may take a look at CUTLASS which supports wgmma. CUTLASS 3.0 is now available! · NVIDIA/cutlass · Discussion #787 · GitHub