In the description of shared-memory matrix descriptor for wgmma.mma_async instruction, the NVIDIA PTX ISA 8.8 manual gives little information about how to obtain the “matrix start address”, which is encoded in bits [0,13] of matrix descriptor. I have 3 questions about “matrix start address”:
1). Is it in bytes?
2). Which state space is it in, generic/shared/else?
3). If it is in the “shared” state space, if the unit is “byte”, for a cluster with 8 blocks on a Hopper device, the maximum start address should be no less than 7x224x1024 which (has 21 bits) exceeds the maximum address number ((2^14-1)x16) the matrix descriptor can encode.
PTX manual:
The shared memory descriptor describes the properties of multiplicand matrix in shared memory including its location in the shared memory of the current CTA.
The following must be 16-byte aligned:
Matrix start address
So probably (not fully covered by documentation) you divide the address by 16.
Thus with 14 bits you can address 256 KiB max. of the current SM (not the shared memory of other SMs in the thread block cluster).
UPDATE: Through some test I found that the so called “matrix start address” corresponds to the shared-space address (in bytes) of the first element in matrix. Althrough it may overflow when encoded into descriptor, it would not arise any ambiguity since the shared-memory address range of a CTA is within 256Ki.
But 256 KiB need 18 bits of addressing. And the size of the matrix start is 14 bits.
Yes, you divide the shared state space address by 16
As mentioned here:
what you encode in the matrix descriptor bits 13-0 is:
matrix-descriptor-encode(Matrix start address)
where
matrix-descriptor-encode(x) = (x & 0x3FFFF) >> 4
The same encoding applies to the leading dimension and stride dimension offsets (bits 29-16 and 45-32).