Ldmatrix PTX instruction usage for fp64 double types

I am implementing FP64 GEMM using PTX’s MMA instruction ‘mma.sync.aligned.m16n8k8.row.col.f64.f64.f64.f64.’ However, I am having difficulty understanding how to use the ‘ldmatrix’ instruction ‘ldmatrix.sync.aligned.x4.m8n8.shared.b16 {%0, %1, %2, %3}, [%4]’ for doubles. This is because the registers are uint_32, which do not align with double precision.

AFAIK ldmatrix is originally meant for 16-bit values only.
When loading, it distributes consecutive 16-bit values to different threads.

I would not use it for FP64 values. Of course you could cut them up and save them in a way, where ldmatrix puts the parts together again, but why would you?

ldmatrix is not necessary for doing tensor core computations, it is just a convenience function for loading 16-bit values - and I have sometimes used it for that purpose even without tensor core computations afterwards.

The registers are uint_32 in that function call, but actually they are half2 or short2.

With ldmatrix you provide 32 pointers (each thread of the warp provides a different pointer) to arrays of size 8 containing 16-bit values. Each thread reads in total 8 values, 1 each from 8 different arrays. laneid / 8 decides, which elements to read (gives the index) and laneid % 4 decides, which group of 8 arrays from the 32 pointers to arrays to use.
The first group of 8 arrays comes from threads 0, 1, 8, 9, 16, 17, 24, 25; the second group from 2, 3, 10, 11, 18, 19, 26, 27, and so on, …
IIRC if you use the x1 or x2 version, only the 2 or 4 first elements from the 8 elements are read. So some array pointers are not used.
A bit involved, but one can make it work.