Wgmma matrix start address

jiachen17713 · July 18, 2025, 10:02am

In the description of shared-memory matrix descriptor for wgmma.mma_async instruction, the NVIDIA PTX ISA 8.8 manual gives little information about how to obtain the “matrix start address”, which is encoded in bits [0,13] of matrix descriptor. I have 3 questions about “matrix start address”:
1). Is it in bytes?
2). Which state space is it in, generic/shared/else?
3). If it is in the “shared” state space, if the unit is “byte”, for a cluster with 8 blocks on a Hopper device, the maximum start address should be no less than 7x224x1024 which (has 21 bits) exceeds the maximum address number ((2^14-1)x16) the matrix descriptor can encode.

Curefab · July 18, 2025, 11:58am

PTX manual:

The shared memory descriptor describes the properties of multiplicand matrix in shared memory including its location in the shared memory of the current CTA.

The following must be 16-byte aligned:
Matrix start address

So probably (not fully covered by documentation) you divide the address by 16.

Thus with 14 bits you can address 256 KiB max. of the current SM (not the shared memory of other SMs in the thread block cluster).

jiachen17713 · July 18, 2025, 1:41pm

UPDATE: Through some test I found that the so called “matrix start address” corresponds to the shared-space address (in bytes) of the first element in matrix. Althrough it may overflow when encoded into descriptor, it would not arise any ambiguity since the shared-memory address range of a CTA is within 256Ki.

Curefab · July 18, 2025, 2:18pm

But 256 KiB need 18 bits of addressing. And the size of the matrix start is 14 bits.

Nanodeoclus · July 21, 2025, 11:39am

Yes, you divide the shared state space address by 16

As mentioned here:

what you encode in the matrix descriptor bits 13-0 is:
matrix-descriptor-encode(Matrix start address)
where
matrix-descriptor-encode(x) = (x & 0x3FFFF) >> 4

The same encoding applies to the leading dimension and stride dimension offsets (bits 29-16 and 45-32).

Topic		Replies	Views
Example of Matrix Multiplication(from cuda book) points that i dont anderstend ... CUDA Programming and Performance	6	34347	February 3, 2008
Shared Memory Addressing in Fermi CUDA Programming and Performance	3	8955	June 17, 2010
max matrix size in matrix multiplication matrix example in programming guide CUDA Programming and Performance	6	6994	November 5, 2007
Shared Memory Addressing in PTX Where does it start CUDA Programming and Performance	2	5298	April 12, 2009
Intro CUDA - Matrix Multiplication Returning Odd Values CUDA Programming and Performance	1	5714	June 25, 2009
where is the shared memory?(a SDK Example) CUDA Programming and Performance	0	1185	September 18, 2008
Example of Matrix Multiplication with Shared memory CUDA Programming and Performance	2	2090	June 22, 2011
Some help needed with shared memory and program correctness matrix * vector operation CUDA Programming and Performance	1	1141	November 30, 2008
System reserved shared memory? How can it be possible? CUDA Programming and Performance	5	1517	July 8, 2010
Garbage Value Matrix multiplication using shared memory CUDA Programming and Performance	0	4629	September 25, 2009

Wgmma matrix start address

Related topics