Example with wgmma.mma_async

grynet · June 1, 2023, 8:50am

I am currently exploring the wgmma.mma_async instruction and attempting to utilize it with shared memory. I have written a code snippet resembling the one below. However, I am encountering some difficulties when it comes to loading input matrices into shared memory and constructing matrix descriptors with and without swizzling.

I was wondering if there are any readily available example codes showcasing the usage of wgmma.mma_async?

In the past, during the early days of CUDA, there were often informative blog posts that served as excellent resources for learning about new features.

// load input matrix-a to shared memory
// load input matrix-b to shared memory

wgmma.fence.sync.aligned

wgmma.mma_async.sync.aligned.m64n128k16.f16.f16.f16 ... descriptor_a, descriptor_b, ....

wgmma.commit_group.sync.aligned
wgmma.wait_group.sync.aligned 0

striker159 · June 1, 2023, 9:00am

The descriptor format is explained in the ptx documentation: 1. Introduction — parallel-thread-execution 8.1 documentation

What are your difficulties constructing the descriptors?

grynet · June 1, 2023, 9:10am

I’m currently facing difficulties in obtaining the correct result matrix. The issue lies in either incorrect results or some threads not producing any results.

Here are the descriptors I’m using for wgmma.mma_async.sync.aligned.m64n128k16.f16.f16.f16. I initially started without swizzling (0x0). I launch 1 thread block with 128 threads for simplicity.

Descriptor_A: 0x0000010000100040
  start      :  0x0040
  leading_off:  0x0010 (16)
  stride_off :  0x0100 (256)
  base_offset:  0x0
  swizzle    :  0x0 

Descriptor_B: 0x00000010080000c0
  start      :  0x00c0
  leading_off:  0x0800 (2048)
  stride_off :  0x0010 (16)
  base_offset:  0x0
  swizzle    :  0x0

striker159 · June 1, 2023, 9:28am

The descriptors look correct to me. I won’t be able to help you with the ptx code though.
However, you may take a look at CUTLASS which supports wgmma. CUTLASS 3.0 is now available! · NVIDIA/cutlass · Discussion #787 · GitHub

202476410arsmart · November 23, 2024, 10:45am

agree! Examples are very important! cutlass is too difficult to use…

Topic		Replies	Views
PTXAS: mysterious warning for wgmma.mma_async instruction serialization CUDA NVCC Compiler	8	380	August 15, 2025
Fastest Tiled WMMA for Matrices of Any Size? CUDA Programming and Performance	3	444	October 26, 2024
Bank Conflicts When Using wmma::load_matrix in CUDA without Swizzle? CUDA Programming and Performance	0	208	September 12, 2024
Question about efficient usage of wmma CUDA Programming and Performance	2	400	February 29, 2024
Is loading the matrices in like this good practice for WMMA instructions in C++ CUDA? CUDA Programming and Performance cuda	0	67	December 30, 2024
Wgmma matrix start address CUDA Programming and Performance	4	122	July 21, 2025
How to achieve the functionality of `stmatrix` on devices below SM90 while avoiding issues like non-coalesced memory access? CUDA Programming and Performance	0	59	September 1, 2025
Using Tensor Cores in CUDA Fortran Technical Blog	1	480	March 7, 2025
Wmma vs Wgmma On H100 GPU CUDA Programming and Performance cublas	5	104	December 15, 2025
How to use WMMA efficiently CUDA Programming and Performance	4	8958	October 23, 2020

Example with wgmma.mma_async

Related topics