For example, I know
asm volatile("ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%0, %1, %2, %3}, [%4];\n"
: "=r"(dst.x), "=r"(dst.y), "=r"(dst.z), "=r"(dst.w) : "r"(ptr));
will collectively load matrix in warp. Coud I re-write in C++ to behave same like this inline PTX?
Glad to have your suggestions.